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Abstract 



This paper introduces a probabilistic graphical model for continuous action recognition with two novel components: sub- 
structure transition model and discriminative boundary model. The first component encodes the sparse and global temporal 
transition prior between action primitives in state-space model to handle the large spatial-temporal variations within an 
action class. The second component enforces the action duration constraint in a discriminative way to locate the transition 
boundaries between actions more accurately. The two components are integrated into a unified graphical structure to en- 
T\ able effective training and inference. Our comprehensive experimental results on both public and in-house datasets show 
• , that, with the capability to incorporate additional information that had not been explicitly or efficiently modeled by previous 
Q ■ methods, our proposed algorithm achieved significantly improved performance for continuous action recognition. 

^ I 1. Introduction 

QQ . Understanding continuous human activities from videos, i.e. simultaneous segmentation and classification of actions, is 

Q^^ \ a fundamental yet challenging problem in computer vision. Many existing works approach the problem using bottom-up 
^-H . methods [32], where segmentation is performed as preprocessing to partition videos into coherent constituent parts, and 
^^ \ action recognition is then applied as an isolated classification step. Although a rich literature exists for segmentation of time 
^^ series, such as change point detection [ ], periodicity of cyclic events modeling [ ] and frame clustering [ ], the methods 
CN _ tend to detect local boundaries and lack the ability to incorporate global dynamics of temporal events, which leads to under 
or over segmentation that severely affects the recognition performance, especially for complex actions with diversified local 
motion statistics [13]. 

The limitation of the bottom-up approaches has been addressed by performing concurrent top-down recognition using 
variants of Dynamic Bayesian Network (DBN), where the dynamics of temporal events are modeled as transitions in a 
C^ latent [25, 18] or partially observed state space [14, 28]. The technique has been successfully used in speech recognition 
and natural language processing, while the performance of existing DBN based approaches for action recognition [ >, 10, 
33, 34, 17, 27] tends to be relatively lower [13], mostly due to the difficulty in interpreting the physical meaning of latent 
states. Thus, it becomes difficult to impose additional prior knowledge with clear physical meaning into an existing graphical 
structure to further improve its performance. 

To tackle the problem, in this paper, we show how two additional sources of information with clear physical interpretations 
can be considered in a general graphical structure for state-space model (SSM) in Figure 1 . Compared to a standard Switching 
Linear Dynamic System (SLDS) [ ] model in Figure l.(a), where X, Y and S are respectively the hidden state, observation 
and label, the proposed model in Figure l.(b) is augmented with two additional nodes, Z and D, to describe the substructure 
transition and duration statistics of actions: 

Substructure transition Rather than a uniform motion type, a real- world human action is usually characterized by a set of 
inhomogeneous units with some instinct structure, which we call substructure. Action substructure arises from two factors: 
(1) the hierarchical nature of human activity, where one action can be temporally decomposed into a series of primitives 
with spatial-temporal constraints; (2) the large variance of action dynamics due to differences in kinematical property of 
subjects, feedback from environment, or interaction with objects. For the first factor, Hoai et al. [ ] used multi-class 
Support Vector Machine (SVM) with Dynamic Programming to recognize coherent motion constituent parts in an action; 
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Figure 1. (a) Tradition SLDS model for continuous action recognition, where each action is represented by an LDS; (b) the structure of our 
proposed model, in which each action is represented by an SLDS with substructure transition, and the inter action transition is by controlled 
by discriminative boundary model. 

Liu et al. [ ] applied latent- SVM for temporal evolving of "attributes" in actions; Sung et al. [^^] introduced a two-layer 
Maximum Entropy Markov Models to recognize the correspondence between sub-activities and human skeletal features. 
For the second factor, considerations have been paid to the substructure variance caused by subject-object interaction using 
Connected Hierarchic Conditional Random Field (CRF) [17], and the substructure variance caused by pose using Latent Pose 
CRF [27]. 

In more general cases, Morency et al. presented the Latent Dynamic CRF (LDCRF) algorithm by adding a "latent- 
dynamic" layer into CRF for hidden substructure transition [ ]. The limitation of CRF as a discriminative method is that, 
one single pseudo-likelihood score is estimated for an entire sequence which is incapable to interpret the probability of each 
individual frame. To solve the problem, we instead design a generative model as in Figure. L(b), with extra hidden node 
Z gating the transition amongst a set of dynamic systems, and the posterior for every action can be inferred strictly under 
Bayesian framework for each frame. The dimension of state space increases geometrically with an extra hidden node, so we 
introduce effective transition prior constraints in Section 2 to avoid over-fitting on a limited amount of training data. 

Duration model The duration statistics of actions is important in determining the boundary where one action transits 
to another in continuous recognition tasks. Duration model has been widely adopted in Hidden Markov Model (HMM) 
based methods, such as the explicit duration HMM [ ] or more generally the Hidden Semi Markov Model (HSMM) [39]. 
Incorporating duration model into SSM is more challenging than HMM because SSM has continuous state space, and exact 
inference in SSM is usually intractable Y ] . Some works reported in this line include Cemgil et al. [ ] for music transcription 
and Chib and Dueker [6] for economics. Oh et al. [29] imposed the duration constraint at the top level of SLDS and achieved 
improved performance for honeybee behavior analysis [']. In general, naive integration of duration model into SSM is not 
effective, because duration patterns vary significantly across visual data and limited training samples may bias the model 
with incorrect duration patterns. 

To address this problem, in Figure L(b) we correlate duration node D with the continuous hidden state node X and 
the substructure transition node Z via logistic regression as explained in Section 3. In this way, the proposed duration 
model becomes more discriminative than conventional generative models, and the data-driven boundary locating process can 
accommodate more variation in duration length. 

In summary, the major contribution of the paper is to incorporate two additional models into a general SSM, namely the 
Substructure Transition Model (STM) and the Discriminative Boundary Model (DBM). We also design a Rao-Blackwellised 
particle filter for efficient inference of proposed model in Section 4. Experiments in Section 5 demonstrate the superior 
performance of our proposed system over several existing state-of-the-arts in continuous action recognition. Conclusion is 
drawn in Section 6. 



2. Substructure Transition IModel 

Linear Dynamic Systems (LDS) is the most commonly used SSM to describe visual features of human motions. LDS is 
modeled by linear Gaussian distributions: 



p{Yt=yt\Xt = X,) =Ar(y,;Bx,,R) 
p(Xt+i =xt+i|Xt =xt) = A/'(xt+i; Axt,Q) 
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Figure 2. STM trained for action "move-arm" in stacking dataset using left-to-right (a), sparse (b), and block-wise sparse (c) constraints, 
with Nz — 5 and Nq = 3. STM in (c) captures global ordering and local details better than the other two. 

where Yt is the observation at time t, Xt is a latent state, A/'(x; /l^, X)) is multivariate normal distribution of x with mean fj. 
and covariance X). To consider multiple actions, SLDS [ ] is formulated as a mixture of LDS's with the switching among 
them controlled by action class St . However, each LDS can only model an action with homogenous motion, ignoring the 
complex substructure within the action. We introduce a discrete hidden variable Zf G {1, ..., Nz} to explicitly represent such 
information, and the sub structured SSM can be stated as: 

p{Yt = yt\Xt = Xt,Sl, Z{) = A/-(y,; B*^x„ R^^) (3) 

p{Xt+i = xt+i \Xt = Xt, Sl+„ Zf+,) = AA(xt+i ; A'^x*, Q^^') (4) 

where A*-^, B*^, Q*^, and R*-^ are the LDS parameters for the j^^ action primitive in the substructure of i^^ action class. 
{Zt} is modeled as a Markov chain and the transition probability is specified by multinomial distribution: 

p{zi^,\zi,s^+,) = e,jk (5) 

In the following, the term STM may refer to either the transition matrix in Eq. (5) or the overall substructured SSM depending 
on its context. Some examples of STM are given in Fig. 2, which are to be explained in detail in the remainder of this section. 

2.1. Sparsity Constrained STM 

We use simplified notation = {Oij} for the STM within a single action. In general, can be any matrix as long 
as each row of it is a probability vector, which allows the substructure of action primitives to be organized arbitrarily. For 
most real-world human actions, however, there is a strong temporal ordering associated with the primitive units. Such order 
relationship can be vital to accurate action recognition, since a different temporal ordering can define a totally different action 
even if the composing primitive units are the same. Moreover, if prior information of substructure is incorporated in model 
estimation, the learning process can become more robust to noise and outliers. 

There have been some attempts to characterize the order relationship of primitive units by restricting the structure of 
transition matrix 0. For example, left- to-right HMM [ ] is proposed to model sequential or cyclic ordering of primitive 
units, and the corresponding has non-zero values only on or directly above diagonal, as illustrated in Fig. 2 (a). Sequential 
ordering is a very strong assumption; to describe actions with more flexible temporal patterns, people have resorted to 
switching HMM [ ] or factorial HMM [ ], which model action variations with multiple sequential orderings. All the 
works above assume the order relationship between action units are given a priori; i.e. , the number of non-zero entries 
in is small and their locations are all known. In many cases, however, it is difficult to specify such information exactly, 
and making a wrong assumption can bias the estimation of action model. A more practical approach is to impose a sparse 
transition constraint while leaving the discovery of exact order relationship to training phase. Along this direction, negative 
Dirichlet distribution has been proposed in [ ] as a prior for each row ^^ in 0: 



p(^.)cxn^.7 



(6) 



where a is a pseudo count penalty. The MAP estimation of parameter is 

~ _ max(Cy - a, 0) 



Y.^ max(Cit - a, 0) 



(7) 



where ^ij is the sufficient statistics of (Z^% Z^_^-^). When the number of transitions from z* to z^ in training data is less than 
a, the probabiHty Oij is set to zero. The sparsity enforced in this way often leads to local transition patterns which might be 
actually caused by noise or incomplete data, as shown in Fig. 2 (b). Also, the penalty term a introduces bias to the proportion 

of non-zero transition probabilities, i.e. j^ ^ j^. This bias can be severe especially when £^ij is small. 

2.2. Block-wise Sparse STM 

As we have seen, the sequential order assumption about the transition between action units is too strong, while the sparse 
prior on transition probability is biased and cannot globally regularize the STM. Here we propose a block- wise sparse STM 
which can achieve tradeoff between model sparsity and flexibility. The idea is to divide an action into several stages and each 
stage comprises of a subset of action primitives. The transition between stages is encouraged to be sequential but sparse, such 
that the global action structure can be modeled. At the same time, the action primitives within each stage can propagate freely 
from one to another so that variation in action styles and parameters is also preserved. Our stage- wise transition model is also 
favorable in regard of continuous action segmentation, since the starting and terminating stages can be explicitly modeled to 
enhance discrimination on action boundaries. 

Formally, define discrete variable Qt G {1, ..., Nq} as the current stage index of action, and assume a surjective mapping 
^(•) is given which assigns each action primitive Zt to its corresponding stage Qt'. 

r p{QlZl)>0, if g{i) = q 

\ p{Ql, Zl) = 0. otherwise ^^ 

The choice of g{') depends on the nature of action. Intuitively, we can assign more action primitives to a stage with diver- 
sified motion patterns and less action primitives to a stage with restricted pattern. The joint dynamic transition distribution of 
Qt and Zt is defined as: 

p{Qt^i,Zt^i\Qt,Zt) = p{Qt+i\Qt)p{Zt^i\Qt+i, Zt) (9) 

The second term of Eq. (9) specifies the transition between action primitives, which we want to keep as flexible as possible 
to model diversified local action patterns. The first term captures the global structure between different action stages, and 
therefore we impose an ordered negative Dirichlet distribution as its hyper-prior: 

p($)a W (/)-- (10) 

where ^ = {(/)qr} is the stage transition probability matrix, (j)qr = p(Qt+ilQ?)' ^^^ a is a constant for pseudo count 
penalty. The ordered negative Dirichlet prior encodes both sequential order information and sparsity constraint. It promotes 
statistically a global transition path Q^ -^ Q^ -^ ... -^ Q^^ which can be learned from training data rather than heuristically 
defined as in left-to-right HMM [ ]. An example of the resulting STM is shown in Fig. 2 (c). Note that no in-coming/out- 
going transition is encouraged for Q^/Q^^, which stands for starting/terminating stage. The identification of these two 
special stages is helpful for segmenting continuous actions, as will be discussed in Sec. 3.2. 

2.3. Learning STM 

The MAP model estimation requires maximizing the product of likelihood (9) and prior (10) under the constraint of 
(8). There are two interdependent nodes, Q and Z, involved in the optimization, which make the problem complicated. 
Fortunately, as shown in Appendix A, Eq. (9) can be replaced with the transition distribution of single variable Z and a 
constraint exists for the relationship between and $. Therefore, the node Q (and the associated parameter $) serves 
only for conceptual purpose and can be eliminated in our model construction. The MAP estimation can be converted to the 
following constrained optimization problem: 

m^X jC{S) = ^^ij log Oij - ^ alog(j)qr, (11) 

s.t. (I)qr = ^jeG{r)Oij, i G Q{q), Wq,r 
^j0^j = l, \/i Oij >o, yij 



where ^ij is the sufficient statistics of (Z^% Z^_^-^), Q{q) = {i\g{i) = q}, and {(j)qr} are just auxihary variables. The KKT 
(Karush-Kuhn-Tucker) conditions for optimal solution are: 

-^ - ^i,9U) + 7i - Mij = 0, Vi, j 

_^+ ^ A,. = 0, V^,r 
^'^^ ieQiq) 
l^ij > 0, jJ^ijOij = 0, yij 

where A^^, 7^, and jj^ij are constant multipliers; aqr is equal to a if g 7^ r or g + 1 7^ r, and otherwise. Solving the equation 
set as in Appendix B gives the MAP parameter estimation: 



^9{i),gU) 



^j'eGigij)) ^if 



(12) 






As we can see, the resultant transition matrix is a block- wise sparse matrix, which can characterize both the global structure 
and local detail of action dynamics. Also, within each block (stage), there is no bias in Oij . 

3. Discriminative Boundary Model 

3.1. Logistic Duration Model 

It is straightforward to use a Markov chain to model the transition of action St where p{Slj^i\Sl) = aij. The duration 
information of the i^^ action is naively incorporated into its self-transition probability an, which leads to an action duration 
model with exponential distribution: 

p{duri = r) = a[."-^(l - an), r = 1, 2, 3... 

Unfortunately, only a limited number of real-life events have an exponentially diminishing duration. Inaccurate duration 
modeling can severely affect our ability to segment consecutive actions and identify their boundaries. 

Non-exponential duration distribution can be implemented with duration-dependent transition matrix, such as the one used 
in HSMM [ ]. Fitting a transition matrix for each epoch within the maximum length of duration is often impossible given a 
limited number of training sequences, even when parameter hyperprior such as hierarchical Dirichlet distribution [ ] is used 
to restrict model freedom. Parametric duration distributions such as gamma [ ] and Gaussian [ ] provide a more compact 
way to represent duration and show good performance in signal synthesis. However, they are less useful in inference because 
the corresponding transition probability is not easy to evaluate. 

Here a new logistic duration model is proposed to address the above limitations. We introduce a variable Dt to represent 
the length of time current action has been lasting. {Dt} is a counting process starting from 1, and the beginning of a new 
action is triggered whenever it is reset to 1 : 

visu^siDU^ = ['}^^-^^ ;[^;; (13) 

where aij is the probability of transiting from previous action i to new action j. Notice that the same type of action can be 
repeated if we have a^i > 0. 

Instead of modeling action duration distribution directly, we model the transition distribution of Dt as a logistic function 
of its previous value: 

p{Dl) = 5{c-l) (15) 
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where Ui and f3i are positive logistic regression weights. Eq. (14) immediately leads to the duration distribution for action 
class i: 

Pidun = r) = f[ ^^r^^ X e-(-/'-) (16) 



Fig. 3 (a) shows how the resetting probability of I^t+i changes as a function of Dt with different parameter sets, and the 
corresponding duration distributions are plotted in (b). The increasing probability of transiting to a new action leads to a 
peaked duration distribution, with center and width controlled by Pi and Ui, respectively. Our logistic duration model can be 
easily extended to represent multiple-mode durations if double logistic function [22] is used. 

3.2. Discriminative Boundary Model 

The logistic duration model can be integrated with STM by stacking the two layers of nodes (D-S and Z-X-Y) together. 
The resultant generative model, however, is unable to utilize contextual information for accurate action boundary segmenta- 
tion. Discriminative graphic models, such as MEMM [ ] and CRF [ ], are generally more powerful in such classification 
problem except that they ignore data likelihood or suffer from label bias problem. 

To integrate discriminating power into our action boundary model and at the same time keep the generative nature of the 
action model itself, we construct DBM by further augmenting the duration distribution (which triggers action boundary) with 
the contextual information from latent states X and Z: 



p{Dl^,\Si,DlX-,Zl) 



^y^{d-(5i)+u 



lj^^Md-(3,)+u^J.^ 



(17) 



where Vi, f3i have the same meaning as in Eq. (14), and cjij are the additional logistic regression coefficients. When 
cj^x = 0, no information can be learned from Xt and Zt, and the DBM reduces to a generative one as Eq. (14). A similar 
logistic function has been employed in augmented SLDS [ ], where the main motivation is to distinguish between transitions 
to different states based on latent variable. Our DBM is specifically designed for locating the boundary between contiguous 
actions. It relies on both real valued and categorical inputs. 

As constrained by the STM in Subsection 2.2, each action is only likely to terminate in stage Nq. Therefore, I^t+i can be 
reset to 1 only when the current action in this terminating stage, and we can modify Eq. (17) as: 



p{Dl,\Si,Df,X-,Z{) 



_ / Eq. (17), g{j) = Nq 
0, otherwise 



(18) 



In this way, the number of parameters is greatly reduced and the label unbalance problem is also ameliorated. Now, the 
construction of our action model for continuous recognition has been completed, with the overall structure shown in Figure 1 
(b). 

3.3. Learning DBM 

To learn the parameters u, (3 and u?, we use coordinate descent method to iterate between {z/, (3} and lv. For u and (3, 
given a set of N training sequences with class labels {S*^^^}n=i...Ar, we can easily obtain the values for all duration nodes 



{D*^^^}n=i...Ar according to Eq. (13)-(15). Then fitting the parameters v and /3 is equivalent to performing logistic regression 
with input-output pairs [Pi , ^(5'}+i — Si' yj- The action transition probability {a^j} can be obtained trivially. 

To estimate cj^j, let {T'^^^}n=i...Ar be our training set, where each data sample T*^^^ is a realization of all the nodes 
involved in Eq. (17) at a particular time instance t^^^ and S^^r^) = i. Since X^^^) and Z^^^) are hidden variables, their 
posterior p(Z^(^) I •) = jr^^ and p(X^^^) |Z^(^), •) = A/'(x; /l4*^^\ E*^^^) are first inferred from single action STM, where 
the posterior of X^{r,) is approximated by a Gaussian. The estimation of u)ij is obtained by maximizing the expected log 
likelihood: 

^^^E^Kx-, ,.z\ J.) [log/^"^(x,u;,,)l (19) 

n 

:^p^"^ /log/(-)(x,u;,,)A/'(x;M^"\5](-))dx 

where 

/^"^(x,u;) = n-^ (20) 

and 6*^^^ = p{D^(n)_^i = 1), c*^"^^ = h'i{D^(n) — f3i). The integral in Eq. (19) cannot be solved analytically. Instead, we use 
unscented transform [ ] to approximate the integral with the average over a set of sigma points of A/^(x; /[x^^^ , 5]^^^ ): 



:max 



) _ ) 



4"^ = S M(^) + (VM5]H)fc, /c = l,...,M 



A/(^) - (VM5]H)/e-M, fe = M + 1, ..., 2M 

where M is the dimension of x, {V^)k is the k^^ column of the matrix square root of X). Therefore, Eq. (19) converts to a 
weighted logistic regression problem with features {x^^^}, labels {6*^^^} and weights {p^7 (2M + 1)}. 

4. Inference with Rao-Blackwellised Particle Filter 

In testing, given an observation sequence yi-.r, we want to find the MAP action labels Si:t and the boundaries defined 
by Di-T', we are also interested in the style of actions which can be revealed from Zi^t- To obtain these MAP estimates, 
we are required to find the posterior p{Si:Tj ^1:T, ^i-.tIyi-.t), which is a non-trivial job given the complicated hierarchy 
and nonlinearity of our model. We propose to use particle filtering [ ] for online inference due to its capability in non-linear 
scenario. Moreover, the latent variable Xt can be marginalized by Rao-Blackwellisation [8]. In this way, the computation 
of particle filtering is significantly reduced since Monte Carlo sampling is only conducted in the joint space of {St ^Dt, Zt), 
which has a much lower dimension and a highly compact support (note the sparse transition probability between these 
variables). 

Formally, we decompose the posterior distribution of all the hidden nodes at time t as 

p{SuDt,Zt,Xt\yi:t)=p{SuDt,Zt\yi:t)p{Xt\SuDt,Zt,yi:t) (21) 

where the second term can be evaluated analytically because Xt depends on other variables through linear and Gaussian 
relations. In Rao-Blackwellised particle filter [io], a set of Np samples {{st % di'^ z^^)}^^^ and the associated weights 
{^t Sn^i ^^^ ^^^^ t^ approximate the intractable first term, while the second term is represented by {Xt }n=i^ which are 
analytical distributions of Xt conditioned on corresponding samples: 

xi")(X,)4p(X,|4"\4"\z("\yi.,) (22) 

In our model, Xt i^t) = -^{Xt; ^t ^ ^t ) i^ ^ Gaussian distribution. Thus, the posterior can be represented as 

Np 

p{SuDt,Zt,Xt\y,..t) « X^«;f )55.(4"^)5o.(4"^)52.(4"^)xl"^(Xt) (23) 

n=l 

where the approximation error approaches to zero as Np increases to infinite. 
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Given the samples {s^_i^ d)._-^^ z^_-^^ Xt-i} ^^^ weights {w)._i} at time t — 1, the posterior of {St^ Dt^ Zt) at time t is: 



(n) 



where 



p(5*,A,Z,|yi.,) « ^u;(:!U^,|A,s|:!\) xp(Z,|5*,A,4-l)4"^(5*, A,^t) 



4")(5„A,^*)=/p(y*|xt-i,5*,^*)xi-Ux*-i)xp(A|si"\,4-\,4-l,x*-i)dx,_ 



(24) 



The detailed derivation is shown in Appendix C. It is hard to draw sample from Eq. (24). Instead, we draw new samples 
{si ,di'\zi'') from a proposal density defined as: 



q{Su A, Z,\-) = p{S,\Du si:!\)p(Z,|5t, A, 4-1) X p(A|si"\, di"\, 4-1 ^t\) 
The new sample weights are then updated as follows: 

A^)(S^) An) in)^ 
^t \^t i^t 1 ^t ) 



(n) 



wl oc wl 



(n) 



P\di \s^_i^d^_^^z^_^^^^_i) 



(25) 



(26) 



£j^^(-) is essentially the integral of a Gaussian function with a logistic function. Although not solvable analytically, it can 
be well approximated by a re-parameterized logistic function according to [24]. Details on how to evaluate C^ (•) can be 
found in Appendix D. 

Once we get s^ and z^\ Xt i^ simply updated by Kalman filter. Re-sampling and normalization procedures are 
applied after all the samples are updated as in [6]. 

5. Experimental Results 

Our model is tested on four datasets for continuous action recognition. In all the experiments, we have used parameters 
Nq = 3, Nz = 5, Np = 200. First STM is trained independently for each action using the segmented sequences in 
training set; then DBM is learned from the inferred terminal stage of each sequence. The overall learning procedure follows 
EM paradigm where the beginning and terminating stages are initially set as the first and last 15% of each sequence, and 
the initial action primitives are obtained from K-means clustering. In testing, after the online inference using particle filter, 
we further adjust each action boundary using an off-line inference within a local neighborhood of length 40 centered at the 
initial boundary; in this way, the locally "full" posterior in Sec. 4 is considered. We evaluate the recognition performance by 
per-frame accuracy. Contribution from each model component (STM and DBM) is analyzed separately. 

5.1. Public Dataset 

The first public dataset used is the IXMAS dataset [^^]. The dataset contains 11 actions, each performed 3 times by 10 
actors. The videos are acquired using 5 synchronized cameras from different angles, and the actors freely changed their 
orientation in acquisition. We calculate dense optical flow in the silhouette area of each subject, from which Locality- 
constrained Linear Coding features (LLC)^ [35] are extracted as the observation in each frame. We have used 32 codewords 
and 4 X 4, 2 X 2 and 1x1 spatial pyramid [19]. Table 1 reports the continuous action recognition results, in comparison 
with SLDS^ [28], CRF^ [18] and LDCRF^ [26]. Our proposed model (and each of its components) achieves a recognition 
accuracy higher than all the other methods by more than 10%. 

Table 1 . Continuous action recognition for IXMAS dataset 



SLDS 


CRF 


LDCRF 


STM 


DBM 


STM+DBM 


53.6% 


60.6% 


57.8% 


70.2% 


74.5% 


76.5% 



The second public dataset used is the CMU MoCap dataset ^ . For comparison purpose, we report the results from the 
complete subset of subject 86. The subset has 14 sequences with 122 actions in 8 category. Quaternion feature is derived from 
the raw MoCap data as our observation for inference. Table 2 lists the continuous action recognition results, in comparison 
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^implementation based on BNT from http://code.google.eom/p/bnt/ 
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Figure 4. Example frames from the "stacking" dataset. top-row: RGB images, bottom-row: aligned depth images. 




Figure 5. Example frames from the "assembling" dataset. top-row: RGB images, bottom-row: aligned depth images. 
Table 2. Continuous action recognition for CMU MoCap dataset 



SLDS 


CRF 


LDCRF 


[ ] 


[ ] 


STM 


DBM 


STM+DBM 


80.0% 


77.2% 


82.5% 


72.3% 


90.9% 


81.0% 


93.3% 


92.1% 



with the same set of benchmark techniques as in the first experiment, as well as [30, 31]. Similarly, results from this 
experiment demonstrated the superior performance of our method. It is interesting to note that, in Table 2, the frame-level 
accuracy by using DBM alone is a little higher than its combination with STM. This is because there's only one subject in 
this experiment and no significant variation in substructure is presented in each action type, so temporal duration plays a 
more important role in recognition. Nevertheless, the result attained by STM+DBM is superior than all benchmark methods. 

5.2. In-house dataset 

In addition to the above two public datasets, two in-house datasets were also captured. The actions in these two sets 
feature stronger hierarchical substructure. The first dataset contains videos of stacking/unstacking three colored boxes, which 
involves actions of "move-arm", "pick-up" and "put-down". 13 sequences with 567 actions were recorded in both RGB and 
depth videos with one Microsoft Kinect sensor "^ (Fig. 4). Then object tracking and 3-D reconstruction were performed to 
obtain the 3D trajectories of two hands and three boxes. In this way an observation sequence in R^^ is generated. In the 
experiments, leave-one-out cross-validation was performed on the 13 sequences. The continuous recognition results are listed 
in Table 3. It is noticed that, among the four benchmark techniques, the performance of SLDS and CRF are comparable, 
while LDCRF achieved the best performance. This is reasonable because during the stacking process, each box can be 
moved/stacked at any place on the desk, which leads to large spatial variations that cannot be well modeled by a Bayesian 
Network of only two layers. LDCRF applied a third layer to capture such "latent dynamics", and hence achieved best 
accuracy. For our proposed models, the STM alone brings LDS to a comparable accuracy to LDCRF because it also models 
the substructure transition pattern. By further incorporating duration information, our model outperforms all other existing 
approaches. 

The second in-house dataset is more complicated than the first one. It involves five actions, "move-arm", "pick-up", 
"put-down", "plug-in" and "plug-out", in a printer part assembling task (Fig. 5). The 3D trajectories of two hands and two 
printer parts were extracted using the same Kinect sensor system. 8 sequences were recorded and tested with leave-one-out 
cross-validation. As can be seen from Table 4, our proposed model with both STM and DBM outperforms other benchmark 
approaches by a large margin. 

Table 3. Continuous action recognition for Set I: Stacking 



SLDS 


CRF 


LDCRF 


STM 


DBM 


STM+DBM 


64.4% 


79.6% 


90.3% 


88.5% 


81.3% 


94.4% 



^http://www.xbox.com/kinect 
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Figure 6. Continuous recognition for in-house datasets 
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Table A 


. Continuous action recognition for Set II 


: Assembling 


SLDS 


CRF 


LDCRF 


STM 


DBM 


STM+DBM 


68.2% 


77.7% 


88.5% 


88.7% 


69.0% 


92.9% 



5.3. Discussion 

To provide more insightful comparison between the proposed algorithm and other benchmark algorithms, we show two 
examples of continuous action recognition results from the in-house datasets in Fig. 6. The result given by SLDS contains 
short and frequent switchings between incorrect action types. This is caused by the false matching of motion patterns to an 
incorrect action model. dSLDS P"^] and LDCRF eliminate the short transitions by considering additional context information; 
however, their performances degrade severely around noisy or ambiguous action periods (e.g. the beginning of the sequence 
in Fig. 6.(b)) due to false duration prior or overdependence on discriminative classifier. Our proposed STM+DBM approach 
does not suffer from any of these problems, because STM helps to identify all action classes disregarding their variations, 
and DBM further helps to improve the precision of boundaries with both generative and discriminative duration knowledge. 
Another interesting finding shown in the last rows of (a) and (b) is that the substructure node Z can be interpreted by concrete 
physical meanings. For all the actions in these experiments, we find different object involved in an action corresponds to a 
different value of Z, which dominates the infer values Zi:t in that action. Therefore, in addition to estimating action class, 
we can also find the object associated with the action by majority voting based on Zi:t. In our experiments, all the inferred 
object associations agree with ground truth. 

6. Conclusion and Future Work 

In this paper, we introduce an improved SSM with two added layers modeling the substructure transition dynamics and 
duration distribution for human action. The first layer encodes the sparse and global temporal transition structure of action 
primitives and also maintains action variations. The second layer injects discriminative information into a logistic duration 
model and discovers action boundaries more adaptively. We design a Rao-Blackwellised particle filter for efficient inference. 
Our comprehensive experimental results validate the effectiveness of both two layers of our model in continuous action 
recognition. As future work we plan to apply our model to actions in less constrained scenarios and use more advanced 
low-level descriptors to deal with unreliable observations. 



A. Derivation of Eq.(l 1) 

With the constraint of Eq.(8), we have 
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Equivalently, 



p{Qf^l\Qf^)= E Pi^U^l)^ Vz,j (28) 
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Eq.(27) shows that we can eUminate Q from the substructure transition model, which results in the simplified objective 
function in Eq.(l 1). Eq.(28) leads to the equality constraint in Eq.(l 1). 

B. Derivation of Eq.( 12) 

From the KKT conditions, we have: 
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Note that (^^^ > 0, ^^ (j)qr = 1, and we obtain the second equation in Eq.(12). 



C. Derivation of Eq.(24) and (24) 
Denote X^ = {St, Df, Zf, Xf), and we have: 
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Taking integral with respect to Xt, we get: 
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Eq.(24) and (24) can be obtained by replacing the inner integral with p(yt |xt_i , St^Zt). 

D. Evaluation of Eq.(24) 

From Eq.(l) and Eq.(2), we have: 
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which leads to: 
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where fiy = B^^'A^^Xt_i, Ey = B^^'Q^^B^^'^ + R^^ and A = B^^'A^^'. We also define, fix = ^[% ^x = ^[% and 
have: 
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The above two Gaussian distributions, i.e. the first two terms in the integral of Eq.(24), can be combined as a single 
Gaussian of x^-i. Omit all the subscriptions, and the product of exponential terms is: 
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Therefore, the product of two Gaussian is: 
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The third term in the integral of Eq.(24), defined in Eq.(17), can be re-written as: 
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where /3 = —Vkid — h), ^ = —^ki = 11^11"^' ^nd T{x; a) = -^^Jl^/^ is logistic (or Fermi) function. The probabihty for 
p{Dl\-) can be obtained accordingly. 

To convert the integral in Eq.(24) into a single variable integral, we further introduce a linear transformation: 

v = W^x 

where W^W = I is orthonormal, and W(:, 1) = uj. For Gaussian variable, we have: 
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Therefore, 
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Now we are ready to evaluate Eq.(24) as: 
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where the approximation follows from [24]. 
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